This section offers an in-depth exploration of the structure, statistics, and spread of the data found in our "member model data" and "city model data" tables. It presents a comprehensive snapshot of the data's features and subtleties.
# Dependencies and Setup
import package as pk
Package: Resources loaded. ☑
# Instantiate the 'DataToSpark' class to handle KKBOX data
kkbox = pk.DataToSpark("customer-churn-391917", "kkbox")
# Load specific tables into DataFrames
tables_to_load = ["member_model_main","city_model_main"]
kkbox.load_tables(tables_to_load)
# Retrieve the stored DataFrames
table = kkbox.get_tables()
# Create an instance of the class with models
member_model_df = table["member_model_main"]
city_model_df = table["city_model_main"]
# Call the 'bar_chart' function from plots.py to create a churn bar chart
pk.bar_chart([member_model_df, city_model_df], 'is_churn')
In our dataset, churn represents whether users discontinue their subscription or service. Upon analyzing the member_model and city_model data, we found a pronounced trend towards the "No" churn category: 92.07% in member_model and 95.64% in city_model. This indicates a strong retention rate for the users in both categories.
# City
pk.bar_scatter_chart(member_model_df, 'city')
The member dataset shows that 51.92% of the entries are from a particular city, referred to as 'city 1'. Following this, 'city 13' and 'city 5' have the highest frequencies. Furthermore, there's no observable correlation between the city of origin and the churn rate.
Based on Age Category:
# Call the 'bucketized' function from helpers.py to categorize 'bd' column
member_bd_df=pk.bucketized(member_model_df, 'bd', 'age_group', [1, 13, 20, 35, 65])
pk.bar_scatter_chart(member_bd_df, 'age_group')
city_bd_df=pk.bucketized(city_model_df, 'bd', 'age_group', [1, 13, 20, 35, 65])
pk.bar_scatter_chart(city_bd_df, 'age_group')
Both datasets exhibit an unclear category, with the city dataset displaying the highest frequency in this category. Notably, the age group categorized as Adolescents (13-19 years, labeled as Category 2) showcases the highest churn percentage. This group's churn rate lies outside the typical statistical range for age in these datasets.
Based on Gender Category:
# Gender
pk.bar_scatter_chart(member_model_df, 'gender')
pk.bar_scatter_chart(city_model_df, 'gender')
Both datasets exhibit an unclear category, with the city dataset displaying the highest frequency in this category. Notably, the age group categorized as Adolescents (13-19 years, labeled as Category 2) showcases the highest churn percentage. This group's churn rate lies outside the typical statistical range for age in these datasets.
# registration via "member_model_df"
pk.bar_scatter_chart(member_model_df, 'registered_via')
# registration via "city_model_df"
pk.bar_scatter_chart(city_model_df, 'registered_via')
def generate_charts(model_df):
"""
Generate charts based on the provided DataFrame.
Args:
- model_df (DataFrame): Input DataFrame with 'registration_init_time' column.
Returns:
- tuple: A tuple containing the year-based bar scatter chart and the sub_line chart for month.
"""
# Extract the year from 'registration_init_time' column
date_df = model_df.withColumn("registration_year", pk.year(pk.col("registration_init_time")))
# Filter data to exclude records from the year 2017 and onwards
date_df_filter = date_df.filter(pk.col("registration_year") < 2017)
# Generate a bar scatter chart based on registration year
year_chart = pk.bar_scatter_chart(date_df_filter, 'registration_year')
# Filter data to only include records from the last 4 years (2013-2016)
last_4_df = date_df.filter((pk.col("registration_year") >= 2013) & (pk.col("registration_year") < 2017))
# Extract month and day from 'registration_init_time'
last_4_df = last_4_df.withColumn("registration_month", pk.month("registration_init_time"))
last_4_df = last_4_df.withColumn("registration_day", pk.dayofmonth("registration_init_time"))
# Generate a sub_line chart based on registration month within the years 2013-2016
month_chart = pk.sub_line(last_4_df, ["registration_year", "registration_month"], [2013, 2014, 2015, 2016])
generate_charts(member_model_df)
23/09/14 14:55:15 WARN org.apache.spark.util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
generate_charts(city_model_df)
pk.bar_scatter_chart(member_model_df, 'payment_method_id', 0.4)
pk.bar_scatter_chart(city_model_df, 'payment_method_id', 0.4)
member_plan_df=pk.bucketized(member_model_df, 'payment_plan_days', 'plan_group', [31, 91, 181, 366])
pk.bar_scatter_chart(member_plan_df, 'plan_group', 1)
city_plan_df=pk.bucketized(city_model_df, 'payment_plan_days', 'plan_group', [31, 91, 181, 366])
pk.bar_scatter_chart(city_plan_df, 'plan_group', 1)
member_price_df=pk.bucketized(member_model_df, 'plan_list_price', 'price_group', [1, 51, 101, 201])
pk.bar_scatter_chart(member_price_df, 'price_group')
city_price_df=pk.bucketized(city_model_df, 'plan_list_price', 'price_group', [1, 51, 101, 201])
pk.bar_scatter_chart(city_price_df, 'price_group')
member_act_price_df=pk.bucketized(member_model_df, 'actual_amount_paid', 'act_price_group', [1, 51, 101, 201])
pk.bar_scatter_chart(member_act_price_df, 'act_price_group')
city_act_price_df=pk.bucketized(city_model_df, 'actual_amount_paid', 'act_price_group', [1, 51, 101, 201])
pk.bar_scatter_chart(city_act_price_df, 'act_price_group')
pk.bar_scatter_chart(member_model_df, 'is_auto_renew')
pk.bar_scatter_chart(city_model_df, 'is_auto_renew')
pk.bar_scatter_chart(member_model_df, 'is_cancel')
pk.bar_scatter_chart(city_model_df, 'is_cancel')
pk.bar_scatter_chart(member_model_df, 'activity_count')
pk.bar_scatter_chart(city_model_df, 'activity_count')
def sum_plot(df, param, group_param, bins=[101, 1001, 100]):
sum_df=pk.bucketized(df, param, group_param, bins)
pk.bar_scatter_chart(sum_df, group_param)
sum_plot(member_model_df, 'sum_num_25', 'sum_num_25_group')
sum_plot(city_model_df, 'sum_num_25', 'sum_num_25_group')
sum_plot(member_model_df, 'sum_num_50', 'sum_num_50_group')
sum_plot(city_model_df, 'sum_num_50', 'sum_num_50_group')
sum_plot(member_model_df, 'sum_num_75', 'sum_num_75_group')
sum_plot(city_model_df, 'sum_num_75', 'sum_num_75_group')
sum_plot(member_model_df, 'sum_num_985', 'sum_num_985_group')
sum_plot(city_model_df, 'sum_num_985', 'sum_num_985_group')
sum_plot(member_model_df, 'sum_num_100', 'sum_num_100_group', [1,1000,50])
sum_plot(city_model_df, 'sum_num_100', 'sum_num_100_group', [1,1000,50])
sum_plot(member_model_df, 'sum_num_unq', 'sum_num_unq_group', [1,1000,50])
sum_plot(city_model_df, 'sum_num_unq', 'sum_num_unq_group', [1,1000,50])
sum_plot(member_model_df, 'total_secs', 'total_secs_group', [0,1000000,50000])
sum_plot(city_model_df, 'total_secs', 'total_secs_group', [0,1000000,50000])